Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

RNA-Seq Data Analysis ◾ 183

info (sample id, group, library size (lib.size), and normalization factors (norm.factor)) with

1 values. This data will be updated soon, and more slots will be added as well.

5.3.7.2 Annotation

The row names of the count data frame are the gene symbols as shown in Figure 5.5. For

some of the downstream analysis, we may need the count data to be annotated with the

NCBI Entrez IDs and full gene names which are not included in the count dataset at this

point. To add these annotations to the DGEList object (y), we need to make the Entrez IDs

as the row names instead of the gene symbols. To obtain the Entrez IDs and gene names,

we need to install and load the “org.Hs.eg.db” Bioconductor package, which is a genome-

wide annotation for human based on mapping using Entrez Gene identifiers [34]. You can

install and upload this package by running the following script on R prompt:

if (!require(“BiocManager”, quietly = TRUE))

install.packages(“BiocManager”)

BiocManager::install(“org.Hs.eg.db”)

library(org.Hs.eg.db)

FIGURE 5.6 The DGEList object of the count data.

FIGURE 5.5 The data frame after adding row and column names and removing rows with all zeros.